Towards Cultural-Scale Models of Full Text
نویسندگان
چکیده
In this preliminary study, we examine whether random samples from within given Library of Congress Classification Outline areas yield significantly different topic models. We find that models of subsamples can equal the topic similarity of models over the whole corpus. As the sample size increases, topic distance decreases and topic overlap increases. The requisite subsample size differs by field and by number of topics. While this study focuses on only five areas, we find significant differences in the behavior of these areas that can only be investigated with large corpora like the Hathi Trust. Keywords— digital libraries; topic modeling; topic alignment; random sampling; Hathi Trust; Library of Congress Classification Outline (LCCO). Large-scale digital libraries, such as the Hathi Trust1, give a window into a much greater quantity of textual data than ever before [1]. These data raise new challenges for analysis and interpretation. The constant, dynamic addition and revision of works in digital libraries mean that any study aiming to characterize the evolution of culture using large-scale digital libraries must have an awareness of the implications of corpus sampling. Failing to recognize that any large digital library is merely a sample of a larger set of books published within the culture can lead to unintentionally strong claims about socio-linguistics [2]. New methodologies also require careful consideration for humanistic implications [3]. One methodology with rapid uptake in the study of cultural evolution is probabilistic topic modeling [4]. Topic modeling has been used to characterize the evolution of literary diction [5], the evolution of literary studies [6], and to search large corpora for “the great unread” [7]. Moreover, topic modeling is an integral part of the Hathi Trust Research Center (HTRC)’s Data Capsule [8, 9]. Researchers need confidence in sampling methods used to construct topic models intended to represent very large portions of the HathiTrust collection. For example, topic modeling every book categorized under the Library of Congress Classification Outline (LCCO)2 as “Philosophy” (call numbers B1-5802) is impractical, as any library will be incomplete. However, if it can be shown that models built from different random samples are highly similar to one another, then the project of having a topic model that is sufficiently representative of the entire HT collection may become tractable. http://hathitrust.org/ https://www.loc.gov/catdir/cpso/lcco/ 1 ar X iv :1 51 2. 05 00 4v 1 [ cs .D L ] 1 5 D ec 2 01 5 Methods LCCO Sampling We implemented a random sampling web service that provides the following query interfaces: • sampling(Category, number): It takes a category and the number of random samples as input, and returns a list of book ID which are randomly generated. For example, given category “DQ78-210” and 3, the web service returns “gri.ark:/13960/t50g6cm2v—uc1.31822038210555—uva.x030577307” where the book ID is separated by the pipe symbol; • id(Category): It takes a category as input and returns a list of book ID which has all the books under such a category; • idTotal(Category): It takes a category as input and returns the total number of books under such a category. Figure 1 shows an example of the LCCO hierarchy stored in the HathiTrust Solr Index. Corpus Download Each subject area was downloaded from the HathiTrust on 19 October 2015 using the HathiTrust Data API through the InPhO Topic Explorer interface. The selected areas can be found in Table 1. Topic Modeling LDA topic modeling [10] represents the current state of the art for extracting meaningful data from digitized texts. We use the implementation of LDA embedded within the InPhO Topic Explorer [11], which uses collapsed Gibbs sampling for topic estimation [12]. All corpuses have the NLTK English stoplist removed. Additionally, all words occurring more than 50000 times and less than 15 times removed from the corpus. A reference model is trained on the whole subject area. Multiple other spanning models are trained on the whole subject area. For this preliminary study, we do not select the reference model from the spanning models, but research on model checking [4] and model selection [13] provide guidelines for further research. Finally, multiple subcorpus models are trained on different portions of the whole corpus, selected randomly. Topic alignment A topic alignment is a function that maps one topic in M1 to a topic in M2. For this analysis, M2 is always the reference model, while M1 is either a spanning or subcorpus model. The alignment function is not LCCO Subject Heading # HT Vols nb Art Sculpture 801 tg Bridge Engineering 799 bj71-1185 History of Ethics 898 hd6050-6305 Classes of Labor 1255 tn600-799 Metallurgy 938 Table 1: LCCO Areas Sampled — Areas in the Library of Congress Classification Outline (LCCO) and their representation in the HathiTrust Digital Library.
منابع مشابه
The Geometry of Culture: Analyzing Meaning through Word Embeddings
We demonstrate the utility of a new methodological tool, neural-network word embedding models, for large-scale text analysis, revealing how these models produce richer insights into cultural associations and categories than possible with prior methods. Word embeddings represent semantic relations between words as geometric relationships between vectors in a high-dimensional space, operationaliz...
متن کاملThe Role of Cultural Changes in the Tendency to Childbearing Among Women
Background: Findings and published statistical data show that the fertility rate and the tendency towards childbearing are decreasing in many countries, including Iran. Many personal, social, economic, and cultural factors can cause this decline. Nonetheless, cultural changes have flourished, among other factors in recent years. Therefore, the study aimed to investigate the role of cultural cha...
متن کاملSynergy of poverty of health and cultural awareness with COVID-19 epidemic in Iran: letter to the editor
[Full text in Persian]
متن کاملExamining the Relationship between “Science” and “Religion” in Socio-Cultural Context of the Renaissance: A Kuhnian Reading of Bacon’s New Atlantis
Thomas Kuhn’s model of paradigm shift as an intra-systemic framework to account for changes within the scientific discourse has been adopted by scholars in different fields as diverse as sociology, theology, economy, and education, to name only a few. The present study argues that the same model can usefully be drawn upon to examine the relationship between ‘science’ and ‘religion’ with some re...
متن کاملAdjacency Matrix Based Full-Text Indexing Models
With the rapid growth of online text information and user accesses, query-processing efficiency has become the major bottleneck of information retrieval (IR) systems. This paper proposes two new full-text indexing models to improve query-processing efficiency of IR systems. By using directed graph to represent text string, the adjacency matrix of text string is introduced. Two approaches are pr...
متن کاملCultural Elements in the Translation of Children's Literature: Persian translation of Roald Dahl’s Matilda in focus
Translation can have long-term effects on all languages and cultures. It is not a mere linguistic act, but mostly a cultural act, since language is by nature one of the major carriers of cultural elements. Thus, the translator’s job is not just transferring the meaning of words and sentences from the source text to the target text. Culture-specific items often cause translation problems. Identi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1512.05004 شماره
صفحات -
تاریخ انتشار 2015